[recipe][perf] feat: unified flat performance recipes for all model families#2803
[recipe][perf] feat: unified flat performance recipes for all model families#2803yaoyu-33 wants to merge 54 commits into
Conversation
|
/claude review |
There was a problem hiding this comment.
Overall this looks good — clean consolidation of the perf recipes. Two items:
-
Bug in equivalence test:
_assert_configs_equalusesis not(identity check) instead of!=fortp_comm_overlap_cfgcomparison. This is a latent bug that will produce false failures once any recipe sets a non-Nonetp_comm_overlap_cfgon both old and new paths. -
Stale validation report:
docs/proposals/validate-perf-recipe-parity.mdshows 107/255 passing with 116 mismatches, but the PR description says 245/245 pass. The doc captures a pre-fix snapshot — consider updating it to reflect the final state, or noting prominently that it's a historical record, so future readers don't think half the recipes are broken.
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive "flat perf recipes" framework for performance benchmarking across multiple model families. It adds convention-based recipe functions to replace legacy pipeline configs, implements a new recipe-loading mechanism in the performance runner, establishes a common benchmarking utility, and provides extensive test coverage to validate equivalence between old and new implementations. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Performance Runner
participant Script as run_script.py
participant FlatRecipe as Flat Recipe Module
participant Legacy as Legacy Pipeline
participant Common as common._benchmark_common()
User->>Script: main(model_recipe_name, task, config_variant, ...)
Script->>Script: construct recipe function name
Script->>FlatRecipe: get_perf_recipe_by_name()
alt Flat recipe found
FlatRecipe-->>FlatRecipe: load + call recipe function
FlatRecipe-->>Common: _benchmark_common(cfg)
Common-->>Common: apply benchmark overrides (iters, logging, checks)
FlatRecipe-->>Script: return ConfigContainer
Script->>Script: using_flat_recipe = True
else Flat recipe not found
Script->>Legacy: get_perf_optimized_recipe()
Legacy-->>Script: return ConfigContainer
Script->>Script: set_post_overrides(cfg)
Script->>Script: using_flat_recipe = False
end
Script-->>User: configured ConfigContainer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py (1)
471-477:⚠️ Potential issue | 🟠 MajorBug:
min_lris greater thanmax_lrin the optimizer config.The learning rate schedule has
max_lr=5e-6andmin_lr=3e-5, butmin_lr(3e-5) is actually larger thanmax_lr(5e-6). This will cause unexpected behavior in cosine annealing where the LR increases rather than decreases during decay.🐛 Proposed fix
opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing( lr_warmup_iters=500, lr_decay_iters=300000, max_lr=5e-6, - min_lr=3e-5, + min_lr=5e-7, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py` around lines 471 - 477, The optimizer schedule is configured with min_lr (3e-5) greater than max_lr (5e-6) which inverts the intended cosine annealing behavior; update the call to distributed_fused_adam_with_cosine_annealing (the assignment to opt_cfg, scheduler_cfg) so that max_lr is larger than min_lr (e.g., swap the values or set max_lr to 3e-5 and min_lr to 5e-6) to ensure learning rate decays correctly.
🧹 Nitpick comments (5)
src/megatron/bridge/recipes/gpt_oss/__init__.py (1)
30-44: Minor: Comment placement could be clearer.The
# V1comment on line 35 appears aftergb200but beforegb300, which is inconsistent with the__all__ordering wheregb300comes first in the V1 section. Consider moving the comment to line 31 for consistency.📝 Suggested reordering
# GPT-OSS perf recipes from .gpt_oss_perf import ( + # V1 + gpt_oss_120b_pretrain_64gpu_gb300_bf16_config, + gpt_oss_120b_pretrain_64gpu_gb200_bf16_config, + gpt_oss_120b_pretrain_64gpu_b300_bf16_config, gpt_oss_120b_pretrain_64gpu_b200_bf16_config, - gpt_oss_120b_pretrain_64gpu_b300_bf16_config, - gpt_oss_120b_pretrain_64gpu_gb200_bf16_config, - # V1 - gpt_oss_120b_pretrain_64gpu_gb300_bf16_config, gpt_oss_120b_pretrain_64gpu_h100_bf16_config, + # V2 + gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, + gpt_oss_120b_pretrain_v2_64gpu_gb200_bf16_config, + gpt_oss_120b_pretrain_v2_64gpu_b300_bf16_config, gpt_oss_120b_pretrain_v2_64gpu_b200_bf16_config, - gpt_oss_120b_pretrain_v2_64gpu_b300_bf16_config, - gpt_oss_120b_pretrain_v2_64gpu_gb200_bf16_config, - # V2 - gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, gpt_oss_120b_pretrain_v2_64gpu_h100_bf16_config, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/gpt_oss/__init__.py` around lines 30 - 44, Move the inline section comment so the V1/V2 grouping matches the import order: relocate the "# V1" comment to immediately before gpt_oss_120b_pretrain_64gpu_gb300_bf16_config (so it sits above gpt_oss_120b_pretrain_64gpu_gb300_bf16_config and gpt_oss_120b_pretrain_64gpu_h100_bf16_config) and ensure the "# V2" comment remains before gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config; this aligns the comments with the imports such as gpt_oss_120b_pretrain_64gpu_gb300_bf16_config, gpt_oss_120b_pretrain_64gpu_h100_bf16_config, gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, etc.src/megatron/bridge/recipes/nemotronh/__init__.py (1)
47-74: Minor: Import ordering could be cleaner.The NemotronH 56B imports (lines 68-73) are placed after the Nemotron 3 Nano H100 imports, making the grouping inconsistent. Consider grouping all NemotronH 56B imports together at the start, followed by all Nemotron 3 Nano imports.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotronh/__init__.py` around lines 47 - 74, The import block mixes NemotronH 56B symbols with Nemotron 3 Nano H100 entries; reorder the import list in __init__.py so all NemotronH 56B configs (nemotronh_56b_pretrain_64gpu_b200_fp8cs_config, nemotronh_56b_pretrain_64gpu_b300_fp8cs_config, nemotronh_56b_pretrain_64gpu_gb200_fp8cs_config, nemotronh_56b_pretrain_64gpu_gb300_fp8cs_config, nemotronh_56b_pretrain_64gpu_h100_fp8cs_config) are grouped together (e.g., at the start of the block) and then group all Nemotron 3 Nano configs (nemotron_3_nano_pretrain_*_config) afterwards to keep imports consistently organized and readable.tests/unit_tests/recipes/llama/test_llama3_perf.py (1)
90-101: This parity test does not exercise the new loader path.The suite compares builder outputs directly via
get_perf_optimized_recipe()andnew_fn(), so regressions inscripts/performance/run_script.pystill go green. The two loader bugs in this PR branch change behavior after recipe selection, and this file never hits that code.Also applies to: 306-319
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/llama/test_llama3_perf.py` around lines 90 - 101, The parity test currently compares outputs from get_perf_optimized_recipe and new_fn directly and therefore never executes the new loader code path; update the test to invoke the actual loader entrypoint used by the new pipeline (the function that run_script uses to load and return a recipe) instead of calling new_fn directly so the loader logic runs; specifically, replace or augment calls to get_perf_optimized_recipe/new_fn in test_llama3_perf.py (and the similar block around lines 306-319) to construct a recipe via the new pipeline loader entrypoint, passing the same model, gpu and precision parameters, then compare the final loaded recipe objects to exercise the new loader path and catch the reported regressions.src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py (1)
25-30: Use the Qwen-VL base entrypoints here.These imports pull the text-only configs from
src/megatron/bridge/recipes/qwen/qwen3_moe.pyinstead of the family-local wrappers insrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.py. Today they delegate to the same base, but keeping the perf recipes on the Qwen-VL entrypoints is what preserves any future VLM-specific defaults or backend handling.♻️ Suggested direction
-from megatron.bridge.recipes.qwen.qwen3_moe import ( - qwen3_30b_a3b_pretrain_config, - qwen3_235b_a22b_pretrain_config, -) +from .qwen3_vl import ( + qwen3_vl_30b_a3b_pretrain_config, + qwen3_vl_235b_a22b_pretrain_config, +)Then update the call sites below to use
qwen3_vl_*_pretrain_config().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py` around lines 25 - 30, Replace the direct imports of text-only Qwen configs with the Qwen-VL family entrypoints: instead of importing qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config from qwen.qwen3_moe, import their Qwen-VL counterparts from qwen_vl.qwen3_vl (use qwen3_vl_30b_a3b_pretrain_config and qwen3_vl_235b_a22b_pretrain_config). Then update any call sites in this file that construct or pass configs to use qwen3_vl_*_pretrain_config() (keep existing uses of _perf_precision and _benchmark_common unchanged).src/megatron/bridge/recipes/llama/llama3_perf.py (1)
55-71: Promote_perf_precision()to the shared recipe helpers.
src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.pyalready imports this helper, so the precision preset logic now lives behind a Llama-specific module boundary. Moving it alongside_benchmark_common()insrc/megatron/bridge/recipes/common.pykeeps the dependency direction clean and avoids making other families import the whole Llama perf module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/llama/llama3_perf.py` around lines 55 - 71, Move the _perf_precision function out of the Llama-specific file into the shared recipe helpers alongside _benchmark_common in the common recipe module so other model families can import it without depending on the Llama perf module; specifically, relocate the function named _perf_precision (and its use-sites such as imports in qwen3_vl_perf.py) into src/megatron/bridge/recipes/common.py, preserving the logic that returns bf16_mixed(), bf16_with_fp8_current_scaling_mixed() (with cfg.first_last_layers_bf16 = False), bf16_with_mxfp8_mixed(), and bf16_with_nvfp4_mixed(), update any imports that referenced _perf_precision from llama3_perf.py to import it from the common helper, and ensure no other Llama-specific symbols are moved or referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/proposals/validate-perf-recipe-parity.md`:
- Around line 7-208: The parity report in validate-perf-recipe-parity.md is
stale (shows 107/255 passed, 116 mismatched, 22 import errors) — either update
the document to reflect the new final result (245 of 245 parity run) by changing
the Summary counts, removing or collapsing obsolete Category 3/4 sections and
the Root Cause/Next Steps that are already addressed, and note the 10 DeepSeek
skips; or delete/archival the file if you don't want historical pre-fix details
in docs; search for the document title "validate-perf-recipe-parity.md" and edit
its Summary, tables, and category sections (and any references to
_benchmark_common(), qwen3_vl_30b_a3b_pretrain_config, DeepSeek entries) to
match the current successful run.
In `@scripts/performance/run_script.py`:
- Around line 31-78: get_perf_recipe_by_name currently builds and calls flat
recipe builders with no arguments so --optimizer-type is ignored for Kimi;
update get_perf_recipe_by_name signature to accept an optimizer_type parameter
and thread it when invoking the recipe function (e.g., call
recipe_fn(optimizer_type=optimizer_type) if the builder accepts it), update main
to pass args.optimizer_type into get_perf_recipe_by_name when using the flat
path, and update the Kimi flat recipe builder functions to accept an
optimizer_type arg and forward it into kimi_k2_pretrain_config(...) so the
selected optimizer is honored.
- Around line 113-124: The code currently skips calling set_post_overrides when
using_flat_recipe is true, which omits important launch-time mutations (e.g.,
BF16 + Adam precision-aware optimizer, overlap_param_gather_with_optimizer_step,
and GPU-count-based GBS scaling). Instead of skipping set_post_overrides
entirely, extract the non-workload-dependent mutations from set_post_overrides
into a small helper (e.g., apply_launch_overrides) and call that helper for both
flat and legacy recipes; keep workload-dependent logic in set_post_overrides and
only skip that part for flat recipes, ensuring recipe still receives
BF16/optimizer/overlap/GBS adjustments even when using_flat_recipe is true.
In `@src/megatron/bridge/recipes/deepseek/deepseek_v3_perf.py`:
- Around line 489-495: The H100 recipe is only disabling
cfg.comm_overlap.overlap_grad_reduce but not cfg.ddp.overlap_grad_reduce, so
overlap will remain enabled due to OR logic; update the H100 recipe blocks in
deepseek_v3_perf.py (the sections that call
set_deepseek_v3_pipeline_model_parallel_layout and then _benchmark_common(cfg))
to also set cfg.ddp.overlap_grad_reduce = False alongside
cfg.comm_overlap.overlap_grad_reduce = False; ensure you add the same assignment
in each H100 recipe that currently only sets
cfg.comm_overlap.overlap_grad_reduce so they match the GB300/B300 pattern and
the base deepseek_v3_pretrain_config override is undone.
In `@src/megatron/bridge/recipes/kimi/kimi_k2_perf.py`:
- Around line 49-50: The dataset sequence-length override is using the wrong
attribute name (cfg.dataset.seq_length) so it has no effect; update all
occurrences in the Kimi perf builders to use cfg.dataset.sequence_length instead
of cfg.dataset.seq_length (leave cfg.model.seq_length as-is), i.e., replace
cfg.dataset.seq_length -> cfg.dataset.sequence_length for each builder
function/config block in kimi_k2_perf.py so the dataset config set in kimi_k2.py
is actually overridden.
In `@tests/unit_tests/recipes/test_all_perf_equivalence.py`:
- Around line 171-224: The parity check in _assert_configs_equal is incomplete:
it skips many fields (dataset, optimizer, dist, peft, profiling and additional
model/comm_overlap fields) so ConfigContainer differences can be missed; update
_assert_configs_equal to compare all relevant sub-objects (dataset, optimizer,
dist, peft, profiling) and missing model/comm_overlap attributes (e.g.,
dataset.seq_length, dataset_kwargs, wgrad_deferral_limit,
moe_flex_dispatcher_backend) by either iterating over their dataclass fields or
delegating to _compare_dataclass for each of these members, and ensure
comm_overlap nested objects (like tp_comm_overlap_cfg) are compared deeply
rather than by identity so that old vs new recipe changes are detected.
- Around line 274-282: The helper _make_cases currently swallows ValueError from
_parse_recipe_name causing misnamed "*_config" recipes to be ignored; change it
to fail fast by removing the try/except or re-raising a descriptive error (or
call pytest.fail) when _parse_recipe_name(name) raises, so malformed recipe
names surface in test failures; update the block around _make_cases and the call
to _parse_recipe_name to propagate the exception with a clear message including
the offending name rather than silently passing.
---
Outside diff comments:
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 471-477: The optimizer schedule is configured with min_lr (3e-5)
greater than max_lr (5e-6) which inverts the intended cosine annealing behavior;
update the call to distributed_fused_adam_with_cosine_annealing (the assignment
to opt_cfg, scheduler_cfg) so that max_lr is larger than min_lr (e.g., swap the
values or set max_lr to 3e-5 and min_lr to 5e-6) to ensure learning rate decays
correctly.
---
Nitpick comments:
In `@src/megatron/bridge/recipes/gpt_oss/__init__.py`:
- Around line 30-44: Move the inline section comment so the V1/V2 grouping
matches the import order: relocate the "# V1" comment to immediately before
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config (so it sits above
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config and
gpt_oss_120b_pretrain_64gpu_h100_bf16_config) and ensure the "# V2" comment
remains before gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config; this aligns the
comments with the imports such as gpt_oss_120b_pretrain_64gpu_gb300_bf16_config,
gpt_oss_120b_pretrain_64gpu_h100_bf16_config,
gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, etc.
In `@src/megatron/bridge/recipes/llama/llama3_perf.py`:
- Around line 55-71: Move the _perf_precision function out of the Llama-specific
file into the shared recipe helpers alongside _benchmark_common in the common
recipe module so other model families can import it without depending on the
Llama perf module; specifically, relocate the function named _perf_precision
(and its use-sites such as imports in qwen3_vl_perf.py) into
src/megatron/bridge/recipes/common.py, preserving the logic that returns
bf16_mixed(), bf16_with_fp8_current_scaling_mixed() (with
cfg.first_last_layers_bf16 = False), bf16_with_mxfp8_mixed(), and
bf16_with_nvfp4_mixed(), update any imports that referenced _perf_precision from
llama3_perf.py to import it from the common helper, and ensure no other
Llama-specific symbols are moved or referenced.
In `@src/megatron/bridge/recipes/nemotronh/__init__.py`:
- Around line 47-74: The import block mixes NemotronH 56B symbols with Nemotron
3 Nano H100 entries; reorder the import list in __init__.py so all NemotronH 56B
configs (nemotronh_56b_pretrain_64gpu_b200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_b300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_h100_fp8cs_config) are grouped together (e.g., at
the start of the block) and then group all Nemotron 3 Nano configs
(nemotron_3_nano_pretrain_*_config) afterwards to keep imports consistently
organized and readable.
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py`:
- Around line 25-30: Replace the direct imports of text-only Qwen configs with
the Qwen-VL family entrypoints: instead of importing
qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config from
qwen.qwen3_moe, import their Qwen-VL counterparts from qwen_vl.qwen3_vl (use
qwen3_vl_30b_a3b_pretrain_config and qwen3_vl_235b_a22b_pretrain_config). Then
update any call sites in this file that construct or pass configs to use
qwen3_vl_*_pretrain_config() (keep existing uses of _perf_precision and
_benchmark_common unchanged).
In `@tests/unit_tests/recipes/llama/test_llama3_perf.py`:
- Around line 90-101: The parity test currently compares outputs from
get_perf_optimized_recipe and new_fn directly and therefore never executes the
new loader code path; update the test to invoke the actual loader entrypoint
used by the new pipeline (the function that run_script uses to load and return a
recipe) instead of calling new_fn directly so the loader logic runs;
specifically, replace or augment calls to get_perf_optimized_recipe/new_fn in
test_llama3_perf.py (and the similar block around lines 306-319) to construct a
recipe via the new pipeline loader entrypoint, passing the same model, gpu and
precision parameters, then compare the final loaded recipe objects to exercise
the new loader path and catch the reported regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 267e13dc-daa7-4963-b049-fa7790d3d428
📒 Files selected for processing (21)
docs/proposals/validate-perf-recipe-parity.mdscripts/performance/run_script.pysrc/megatron/bridge/recipes/common.pysrc/megatron/bridge/recipes/deepseek/__init__.pysrc/megatron/bridge/recipes/deepseek/deepseek_v3_perf.pysrc/megatron/bridge/recipes/gpt_oss/__init__.pysrc/megatron/bridge/recipes/gpt_oss/gpt_oss_perf.pysrc/megatron/bridge/recipes/kimi/__init__.pysrc/megatron/bridge/recipes/kimi/kimi_k2_perf.pysrc/megatron/bridge/recipes/llama/__init__.pysrc/megatron/bridge/recipes/llama/llama31_perf.pysrc/megatron/bridge/recipes/llama/llama3_perf.pysrc/megatron/bridge/recipes/nemotronh/__init__.pysrc/megatron/bridge/recipes/nemotronh/nemotronh_perf.pysrc/megatron/bridge/recipes/qwen/__init__.pysrc/megatron/bridge/recipes/qwen/qwen3_moe_perf.pysrc/megatron/bridge/recipes/qwen_vl/__init__.pysrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.pysrc/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.pytests/unit_tests/recipes/llama/test_llama3_perf.pytests/unit_tests/recipes/test_all_perf_equivalence.py
|
/ok to test 3d47abd |
|
/ok to test 3859b91 |
|
/ok to test eecbe1e |
|
/claude review |
|
/ok to test 79ec86a |
|
/ok to test 188a5bb |
|
Theo follow-up fix pushed and revalidated. Validated PR head Summary:
Validation counts: No local unit tests were run; this was targeted internal dry-run/config validation plus pre-commit on the changed tree. |
|
/ok to test de96af2 |
ConfigContainer.yaml stopped appearing in CI artifacts because Bridge's maybe_log_and_save_config calls cfg.to_yaml() via plain open(..., "w"), which fails with FileNotFoundError if the parent directory doesn't exist. The failure is swallowed by maybe_log_and_save_config's try/except and the after_script's `cp ... || true`, so the absence is silent. Both places in set_user_overrides / _set_checkpoint_overrides that assign recipe.logger.save_config_filepath now also create the parent directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
PR #2803 Config Dump vs Legacy Run Parity ReportScopeCompared the finalized legacy performance config path on Branches
How I Ran It
Fixes Made
Example Diffs Before Fix
Final ResultNo old/new dump failures remained. Validation
ArtifactsComparison artifacts are in the remote scratch workspace under:
No tokens or environment-specific paths are included in this report. |
…ipes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 41622a5 |
PR #2803 Current-Head Config Dump vs Legacy Run ParityValidated current PR head Baseline used for this comparison: the current-head legacy performance path ( Comparison summary: Gated/inaccessible configs: Static/check status observed:
Artifacts retained internally under
|
|
/ok to test c48ff31 |
Summary
Consolidate performance benchmark recipes from the old two-level pipeline (
scripts/performance/configs/) into self-contained flat recipes undersrc/megatron/bridge/recipes/<family>/<model>_perf.py.Each perf recipe function is fully self-contained: it calls the base pretrain/SFT recipe, overrides parallelism and precision inline, then calls
_benchmark_common(). This eliminates the old indirection throughWorkloadBaseConfig+set_workload_base_configs+get_perf_optimized_recipe.Models covered
Key changes
_benchmark_common()helper incommon.pyfor shared perf overrides (train_iters, timing,te_rng_trackerderived fromcuda_graph_impl)_perf_precision()helper for bf16/fp8_cs/fp8_mx/nvfp4 mixed precision configsqwen3_vl_30b_a3b_pretrain_config,qwen3_vl_235b_a22b_pretrain_config) delegating to text-only configsrun_script.pyto support new flat recipe discovery pathVerification
All 245 recipe pairs pass config parity equivalence test on the cluster:
Test plan
tests/unit_tests/recipes/test_all_perf_equivalence.py— 245/245 pass (old vs new config parity)tests/unit_tests/recipes/llama/test_llama3_perf.py— Llama3 perf recipe unit testsMade with Cursor
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests